library(ipumsr)
library(tidyverse)
library(tidymodels)
library(ggplot2)
library(corrplot)
library(VGAM)
library(rpart)
library(rpart.plot)
library(hardhat)
library(ranger)
library(recipes)
library(srvyr)
library(haven)
library(survey)
library(tidyclust)
library(GGally)
library(patchwork)
library(dotenv)
library(httr)
library(jsonlite)
library(tidycensus)
library(tigris)
library(lubridate)
library(patchwork)
library(stringr)
library(dplyr)
library(themis)
library(parsnip)
library(rsample)
library(yardstick)
library(vip)
library(kknn)
library(caret)
library(httr)
library(jsonlite)
library(lubridate)
library(sf)
library(readr)
library(stringr)
library(janitor)
library(recipes)
library(RColorBrewer)
set.seed(20231215)Data Science for Public Policy
Final Project
Final Project
The Importance of Health Care Access and Economic Stability for Immigrant Communities in the United States
Immigration is deeply embedded into the history of the United States. For centuries, the arrival of individuals born abroad has brought new ideas, cultures, and economic/social benefits to the nation. Despite this, there is evidence that many of these individuals experience additional barriers to accessing the economic and social opportunities that exist, and experience high economic and health disparities as compared to those born in the United States.
Research delving into the historical health and economic disparities that face immigrant communities has focused on limited access to health insurance coverage and negative economic outcomes related to poverty, like food insecurity and housing instability (Chang, 2019). According to Derose et. al., immigrants “have lower rates of health insurance, use less health care, and receive lower quality of care than U.S.-born populations” (2007). “Foreign-born residents of America’s suburbs experienced markedly higher poverty rates (14.1 percent) than the U.S. born (9.8 percent) in 2009,” according to Suro et. al. (2011). Two factors that contribute to physical well being is access to health insurance and access to health centers. Health insurance facilitates access to care and is associated with lower death rates, better health outcomes, and improved productivity. U.S. residents obtain health coverage from a variety of private and public sources, such as through their employers or direct purchase on the individual market (private sources), as well as through the Medicare, Medicaid, or Veterans Affairs programs (public sources).
Additionally, the US immigrant population is not monolithic, and it is important to explore factors that can influence an individuals health or economic status such as documentation status or how many years they have spent in the US. The important of these factors has been illustrated by prior research: A May 2022 study by the Guttmacher Institute, used the 2019 American Community Survey and the 2017−2018 National Health Interview Survey to conclude that there exist disparities by immigration status in access to health insurance, as well as primary and reproductive health care. In another study that explored health insurance coverage of U.S. immigrants, Dr. Ku analyzed the differences between established immigrants versus recent immigrants (Ku, 2011). Recent immigrants were categorized as immigrants who had been in the U.S. for less than a decade, while established immigrants were categorized as immigrants who had been in the U.S. for over a decade. When comparing the health insurance status of these two groups, Ku found that only 44% of recent immigrants were fully covered by health insurance, where established immigrants were at 63%. These results reveal the impact time in America has on the insurance coverage of immigrants.
Research Question
Through exploratory data analysis, supervised and unsupervised machine learning, and geospatial analysis, this research project strives to explore factors that contribute to the health and economic status of immigrant communities. Specifically, we sought to understand health care access in immigrant communities, understanding how the year of arrival influences ones income levels, and exploring differentiating factors that could influence health/economic status outside of only being born in the United States. With an increased understanding of these factors, policy makers are more effectively address disparities that exist, as well as identify protective factors that support immigrants thriving within their communities.
Bibliography
“American Community Survey.” U.S. Census Bureau, [2022]. Census Bureau, U.S. Department of Commerce.
Chang, C. D. (2019). Social determinants of health and health disparities among immigrants and their children. Current problems in pediatric and adolescent health care, 49(1), 23-30.
Derose, K. P., Escarce, J. J., & Lurie, N. (2007). Immigrants and health care: sources of vulnerability. Health affairs, 26(5), 1258-1268.
Flood, Sarah, Miriam King, Renae Rodgers, Steven Ruggles, J. Robert Warren, Daniel Backman, Annie Chen, Grace Cooper, Stephanie Richards, Megan Schouweiler, and Michael Westberry. IPUMS CPS: Version 11.0 [dataset]. Minneapolis, MN: IPUMS, 2023. https://doi.org/10.18128/D030.V11.0
Fuentes, L., Desai, S., and Dawson, R.. (2022). “New Analyses on US Immigrant Health Care Access Underscore the Need to Eliminate Discriminatory Policies.” Www.guttmacher.org, May. https://doi.org/10.1363/2022.33551.
Ku, Leighton. (2011). Health Insurance Coverage and Medical Expenditures of Immigrants and Native-Born Citizens in the United States.American Journal Of Public Health, 99(7), 1322-1328.
Pourat, N., Wallace, S. P., Hadler, M. W., & Ponce, N. (2014). Assessing health care services used by California’s undocumented immigrant population in 2010. Health Affairs, 33(5), 840-847
Schumacher, S., and Presiado, M. (2023). “Health and Health Care Experiences of Immigrants: The 2023 KFF/LA Times Survey of Immigrants.” KFF. September 17, 2023. https://www.kff.org/racial-equity-and-health-policy/issue-brief/health-and-health-care-experiences-of-immigrants-the-2023-kff-la-times-survey-of-immigrants/.
Suro, R., Wilson, J. H., & Singer, A. (2011). Immigration and poverty in America’s suburbs. Metropolitan Opportunity Series, 1-21.
U.S. Census Bureau. (2022). 2017-2022 American Community Survey 5-year Public Use County Samples
US Health Resources and Service Administration (2023). 2023 Federally Qualified Health Centers and Look-Alikes Data.
US Census Bureau (2023). 2007-2023 TIGER/Line Shapefiles
2. Data Sources
Library
Current Population Survey - 2022
We are utilizing the Current Population Survey from 2022 as a major source of data for our analysis to explore factors that would contribute to an immigrant’s income or health status. This survey is conducted each year to capture statistical information on the population. The units of measurement are individual adults. The survey contains information on immigration demographic characteristics in addition to health insurance status. Below, we will utilize unsupervised machine learning techniques to analyze the relationship between health and demographic characteristics and estimate a model, given that the survey has many correlated predictors. We will also use the data in a supervised machine learning capacity to predict health coverage status based on several demographic predictors.
We used the IPUMS API to download the 2022 CPS data. We read in the data using the ipumsr package. To read in the data, an IPUMS API key is required and read in using the dotevn package.
#using IPUMS API to access CPS ASEC 2022 data
dotenv::load_dot_env(file = ".env")
# Access the values
ipums_key <- Sys.getenv("ipums_api_key")
cps_extract_request <- define_extract_cps(
description = "IPUMS-CPS, ASEC 2022",
samples = c("cps2022_03s"),
variables = c("CPSID", "ASECFLAG", "ASECWTH", "PERNUM",
"CPSIDV", "CPSIDP", "ASECWT", "AGE",
"SEX", "RACE", "MARST", "POPSTAT", "ASIAN",
"FAMSIZE", "NCHILD", "NCHLT5", "BPL", "YRIMMIG",
"CITIZEN", "MBPL", "FBPL", "NATIVITY", "HISPAN",
"EMPSTAT", "LABFORCE", "EDUC", "FTOTVAL", "INCTOT",
"INCWAGE", "OFFPOV", "MIGSTA1", "MIGRATE1",
"HIMCAIDNW", "HIMCARENW", "ANYCOVNW", "ANYCOVLY", "HHINCOME", "PRVTCOVNW",
"GRPCOVNW", "DPCOVNW", "MRKCOVNW", "TRCCOVNW", "INHCOVNW")
)
submitted_extract <- submit_extract(extract = cps_extract_request, api_key = Sys.getenv("ipums_api_key"))
downloadable_extract <- wait_for_extract(extract = submitted_extract, api_key = Sys.getenv("ipums_api_key"))
data_files <- download_extract(downloadable_extract, api_key = Sys.getenv("ipums_api_key"))
|
| | 0%
|
|== | 3%
|
|==== | 6%
|
|============= | 19%
|
|================ | 23%
|
|======================== | 35%
|
|================================== | 49%
|
|=========================================== | 61%
|
|==================================================== | 74%
|
|======================================================================| 100%
|
| | 0%
|
|= | 1%
|
|== | 3%
|
|===== | 7%
|
|====== | 8%
|
|====== | 9%
|
|======= | 10%
|
|======== | 12%
|
|========= | 13%
|
|========== | 14%
|
|============= | 18%
|
|=============== | 21%
|
|=============== | 22%
|
|================ | 23%
|
|================= | 24%
|
|================== | 25%
|
|=================== | 27%
|
|==================== | 29%
|
|======================= | 33%
|
|======================== | 35%
|
|========================= | 35%
|
|========================= | 36%
|
|========================== | 37%
|
|=========================== | 39%
|
|============================ | 40%
|
|============================= | 41%
|
|============================= | 42%
|
|============================== | 43%
|
|================================= | 47%
|
|================================= | 48%
|
|=================================== | 50%
|
|==================================== | 51%
|
|====================================== | 54%
|
|======================================= | 56%
|
|======================================== | 58%
|
|========================================= | 58%
|
|========================================= | 59%
|
|========================================== | 61%
|
|=========================================== | 61%
|
|=========================================== | 62%
|
|============================================ | 62%
|
|============================================ | 63%
|
|============================================= | 65%
|
|============================================== | 66%
|
|=============================================== | 67%
|
|================================================ | 68%
|
|================================================ | 69%
|
|================================================= | 70%
|
|================================================== | 71%
|
|================================================== | 72%
|
|=================================================== | 73%
|
|==================================================== | 74%
|
|===================================================== | 75%
|
|===================================================== | 76%
|
|====================================================== | 77%
|
|====================================================== | 78%
|
|======================================================= | 79%
|
|======================================================== | 80%
|
|========================================================== | 82%
|
|=========================================================== | 84%
|
|=========================================================== | 85%
|
|============================================================ | 86%
|
|============================================================= | 87%
|
|============================================================== | 89%
|
|=============================================================== | 89%
|
|=============================================================== | 90%
|
|================================================================ | 92%
|
|================================================================= | 92%
|
|================================================================== | 94%
|
|=================================================================== | 96%
|
|===================================================================== | 98%
|
|======================================================================| 100%
file_number <- downloadable_extract[["number"]]
ddi_file <- paste0("cps_000", file_number, ".xml")
#Loading in the CPS 2022 data from IPUMS API
ddi <- read_ipums_ddi(ddi_file)
cps <- read_ipums_micro(ddi)American Communities Survey (ACS) 2017-2022, County-Level Selected Demographics
Our team utilized the 2022, 5-year survey American Communities Survey to analyze data on select demographic features of US counties. Particular variables of interest were overall population size of counties, as well as percent of counties that were born abroad and percent of individuals that identified with various demographic groups (race, nativity, ethnicity). Additionally, we pulled data on the percent of individuals with access to health insurance, percent employed, and percent with access to internet.
# loading data
variables <- tidycensus::load_variables(2022, "acs5")
acs_2022 <-
get_acs(
geography = "county",
variables = c(
"B01003_001", # = Total Population
"DP02_0090PE", # = % born in the US
"DP02_0093PE", # = % born in PR
"DP02_0094PE", # = % foreign born
"DP02_0100PE", # = % Native - Entered 2010 or later
"DP02_0101PE", # = % Native - Entered before 2010
"DP02_0103PE", # = % Foreign - Entered 2010 or later
"DP02_0104PE", # = % Foreign - Entered before 2010
"DP02_0106PE", # = % Foreign - Europe
"DP02_0107PE", # = % Foreign - Asia
"DP02_0108PE", # = % Foreign - Africa
"DP02_0109PE", # = % Foreign - Oceania
"DP02_0110PE", # = % Foreign - Latin America
"DP02_0111PE", # = % Foreign - North America
"DP05_0037PE", #= % white
"DP05_0038PE", # = % black
"DP05_0039PE", # = % american indian
"DP05_0044PE", # = % Asian
"DP05_0052PE", # = % Native american pacific islander
"DP02_0065PE", # = % with a bachelors or above
"DP03_0062E", # = Median houshold income
"DP02_0152PE", # = % that uses internet use
"DP03_0004PE", # = % employed
"DP03_0095PE" # = % with health care coverage
),
geometry = TRUE,
year = 2022,
survey = "acs5",
progress = FALSE
) %>%
janitor::clean_names() %>%
mutate(
variable = case_when(
variable == "B01003_001" ~ "total_pop",
variable == "DP02_0090P" ~ "born_us",
variable == "DP02_0093P" ~ "born_pr",
variable == "DP02_0094P" ~ "born_abroad",
variable == "DP02_0100P" ~ "native_after_2010",
variable == "DP02_0101P" ~ "native_before_2010",
variable == "DP02_0103P" ~ "foreign_after_2010",
variable == "DP02_0104P" ~ "foreign_before_2010",
variable == "DP02_0106P" ~ "nativity_europe",
variable == "DP02_0107P" ~ "nativity_asia",
variable == "DP02_0108P" ~ "nativity_africa",
variable == "DP02_0109P" ~ "nativity_oceania",
variable == "DP02_0110P" ~ "nativity_latin",
variable == "DP02_0111P" ~ "nativity_north_amer",
variable == "DP05_0037P" ~ "race_white",
variable == "DP05_0038P" ~ "race_black",
variable == "DP05_0039P" ~ "race_native_amer",
variable == "DP05_0044P" ~ "race_asian",
variable == "DP05_0052P" ~ "race_pacific",
variable == "DP02_0065P" ~ "educ_bachelors",
variable == "DP03_0062" ~ "median_income",
variable == "DP02_0152P" ~ "internet_access",
variable == "DP03_0004P" ~ "employed",
variable == "DP03_0095P"~ "health_insurance"
)) %>%
select(- "moe") %>%
#unpivoting columns for each variable
pivot_wider(names_from = "variable", values_from = "estimate")Health Resources and Services Administration (HRSA) Community Health Center Data
In order to understand the access to health care through access to community health centers, we are utilizing US Health and Services Administration comprehensive listing of federally-qualified community health centers, or unregistered equivalent organizations. The dataset captures data at the “center”-level, where each observation represents a different health center. It is updated on an annual basis, and our team utilized the most recently available information. The dataset includes variables such as address, congressional representative, and latitude and longitude for each center.
#importing from Health and Services Administration
hrsa_health_centers <- read_csv( paste0("https://data.hrsa.gov//DataDownload/DD_Files/","Health_Center_Service_Delivery_and_LookAlike_Sites.csv" )) %>%
janitor::clean_names()
# filtering out Y coordinate rows with Y
hrsa_health_centers <- hrsa_health_centers %>%
filter(geocoding_artifact_address_primary_y_coordinate != "Y") %>%
filter(geocoding_artifact_address_primary_y_coordinate != "N") US Census County Shapefiles
In order to set our analysis up for geospacial analysis and an assessment of community health centers by US counties, we utilized the US Census’ Shapefiles available at the county level. This dataset was utilized to join together the American Communities Survey and Heath Resources and Services Administration’s dataset so that information could be aggregated at the county-level. Please note, the following section of code may need to be run twice in order for the “COUNTY” file to be downloaded and created.
# URL to the shapefile zip file
county_url <- "https://www2.census.gov/geo/tiger/TIGER2022/COUNTY/tl_2022_us_county.zip"
# Specify the destination folder where you want to save the downloaded file
destination_folder <- "COUNTY/tl_2022_us_county.zip"
# Download the shapefile zip file
download.file(county_url, destfile = file.path(destination_folder, "tl_2022_us_county.zip"), mode = "wb")
# Unzip the downloaded file
unzip(file.path(destination_folder, "tl_2022_us_county.zip"), exdir = destination_folder)
# Read the shapefile into R
shapefile <- sf::st_read(file.path(destination_folder, "tl_2022_us_county.shp")) %>%
janitor::clean_names()%>%
select(geometry,geoid) Reading layer `tl_2022_us_county' from data source
`/Users/oliviagomez/Desktop/Intro to Data Science/final_project/COUNTY/tl_2022_us_county.zip/tl_2022_us_county.shp'
using driver `ESRI Shapefile'
Simple feature collection with 3235 features and 17 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.43979
Geodetic CRS: NAD83
3. Data Wrangling and Exploratory Analysis
CPS Data Wrangling and Exploratory Analysis
# remove variables that won't be used for analysis
cps <- cps %>%
janitor::clean_names() %>%
select(-year, -month, -mbpl, -fbpl, -asian, -asecflag, -cpsidv)
# create variable for health insurance status
cps <- cps %>%
mutate(healthinsu = case_when(
himcaidnw == 1 ~ "medicaid",
himcarenw == 1 ~ "medicare",
prvtcovnw == 1 ~ "private",
grpcovnw == 1 ~ "employment based",
dpcovnw == 1 ~ "direct purchase",
mrkcovnw == 1 ~ "marketplace",
trccovnw == 1 ~ "tricare",
inhcovnw == 1 ~ "indian health service",
TRUE ~ NA_character_
))
# ensuring each category is labeled
cps <- cps %>%
mutate(healthinsu = factor(healthinsu, levels = c(
"medicaid", "medicare", "private", "employment based",
"direct purchase", "marketplace", "tricare",
"indian health service"
), labels = c(
"Medicaid", "Medicare", "Private", "Employment Based",
"Direct Purchase", "Marketplace", "Tricare",
"Indian Health Service"
)))
# removing original dummy variables for health insurance status
cps <- cps %>%
filter(!is.na(anycovnw)) %>%
select(-himcaidnw, -himcarenw, -prvtcovnw, -grpcovnw, -dpcovnw, -mrkcovnw, -trccovnw, -inhcovnw) %>%
mutate(healthinsu = as_factor(healthinsu))
#Remove NAs or NIUs coded as numbers
cps <- subset(cps, anycovly != 99)
cps <- subset(cps, empstat != 00)
cps <- subset(cps, hhincome != 99999999)
cps <- subset(cps, sex != 9)
cps <- subset(cps, race != 999)
cps <- subset(cps, marst != 9)
cps <- subset(cps, bpl != 99999)
cps <- subset(cps, citizen != 9)
cps <- subset(cps, hispan != 902)
cps <- subset(cps, labforce != 0)
cps <- subset(cps, educ != 001)
cps <- subset(cps, educ != 999)
cps <- subset(cps, ftotval != 9999999999)
cps <- subset(cps, inctot != 9999999999)
cps <- subset(cps, inctot != 9999999998)
cps <- subset(cps, incwage != 9999999999)
cps <- subset(cps, incwage != 9999999998)
cps <- subset(cps, offpov != 99)
cps <- subset(cps, migsta1 != 00)
cps <- subset(cps, migrate1 != 0)
# dropping popstat variable as it has no variance
cps <- cps %>%
select(-popstat)
# final check for any NAs
cps_na <- is.na(cps)
missing_counts <- colSums(cps_na)
print(missing_counts[missing_counts > 0])named numeric(0)
ACS, HRSA, and Census Data Wrangling
Assessing Missingness:
- The HRSA dataset was one of the messier datasets we worked with, and it had a number of data missing across a range of variables. However, we prioritized the variables that would be of core importance to the analysis. While a number of health centers didn’t have a phone number listed, for example, we didn’t feel that this justified removing it all together.
It is important to note that the HRSA dataset, at times, inconsistently pulled longitude and latitude variables, likely due to regular updates to the data that are being made. In total there were around 6,000 observations that either contained null values in the longitude and latitude fields, or had the values “N” and “Y”. Regarding the latter, one could assume that HRSA’s data input shifted column values from the immediate variables to the right of latitude and longitude columns, as the only valid values for this field were “N” and “Y”. We removed these values to ensure the analysis consistently pulls metrics.
The ACS dataset is a well-prepared data source provided by the US Census Bureau and therefore required minimal cleaning. Similar to above, there were a relatively small number of missing values for the most consequential variables: born_abroad, nativity fields, and born_us. There were 78 places in which all born_us, born_abroad, and born_pr were missing. Upon further investigation we found that all of these variables were for Puerto Rican counties. Given that strong analysis could not be conducted for Puerto Rico for this reason, these observations were dropped. Additionally, other variables missing from this dataset were due to a “logic” built into the preparation of the data. For example, if a county reported that 100% of individuals were us born, then nativity fields were left with null values. These observations were not touched.
Similar to the ACS, the Census Shapefiles are extremely ‘tidy’ datasets. There were no missing values with the Shapefiles.
##HRSA Exploratory Analysis
hrsa_na <- is.na(hrsa_health_centers)
hrsa_missing_counts <- colSums(hrsa_na)
print(hrsa_missing_counts[hrsa_missing_counts > 0]) health_center_number
1
bhcmis_organization_identification_number
1
site_postal_code
5
site_telephone_number
18
site_web_address
4772
operating_hours_per_week
443
fqhc_site_medicare_billing_number
6172
fqhc_site_npi_number
7640
site_added_to_scope_this_date
1
health_center_name
1
health_center_organization_street_address
1
health_center_organization_city
1
health_center_organization_state
18
health_center_organization_zip_code
1
grantee_organization_type_description
445
migrant_health_centers_hrsa_grant_subprogram_indicator
445
community_health_hrsa_grant_subprogram_indicator
445
school_based_health_center_hrsa_grant_subprogram_indicator
445
public_housing_hrsa_grant_subprogram_indicator
445
health_care_for_the_homeless_hrsa_grant_subprogram_indicator
445
u_s_mexico_border_county_indicator
5
state_and_county_federal_information_processing_standard_code
5
county_equivalent_name
5
county_description
5
u_s_congressional_representative_name
22
name_of_u_s_senator_number_one
182
name_of_u_s_senator_number_two
182
x62
11287
##Special examination of latitude and longitude values
#longitudes
#sum(is.na(hrsa_health_centers$geocoding_artifact_address_primary_x_coordinate))
#x_cord_missing <- subset(hrsa_health_centers, is.na(geocoding_artifact_address_primary_x_coordinate)) %>%
#print(x_cord_missing)
#latitudes
#sum(is.na(hrsa_health_centers$geocoding_artifact_address_primary_y_coordinate))
# y_cord_missing <- subset(hrsa_health_centers, is.na(geocoding_artifact_address_primary_y_coordinate))
hrsa_health_centers <-
subset(hrsa_health_centers, !is.na(geocoding_artifact_address_primary_y_coordinate)) #y and x coordinates are missing from same observations, so can just eliminate based on one condition
hrsa_health_centers <-
subset(hrsa_health_centers, !is.na(geocoding_artifact_address_primary_x_coordinate))### ACS Missingness Examination
acs_na <- is.na(acs_2022)
acs_missing_counts <- colSums(acs_na)
print(acs_missing_counts[acs_missing_counts > 0]) educ_bachelors born_us born_pr born_abroad
78 78 78 78
native_after_2010 native_before_2010 foreign_after_2010 foreign_before_2010
169 169 93 93
nativity_europe nativity_asia nativity_africa nativity_oceania
93 93 93 93
nativity_latin nativity_north_amer internet_access median_income
93 93 78 1
# counties in Puerto Rico were not provided with estimate numbers
born_us_missing <- subset(acs_2022, is.na(born_us))
#removing Puerto Rico values
acs_2022 <-
subset(acs_2022, !is.na(born_us))#### Census Missingness Examination
geo_na <- is.na(shapefile)
geo_missing_counts <- colSums(geo_na)
print(geo_missing_counts[geo_missing_counts > 0])named numeric(0)
In regards to the data wrangling for these datasets, the ultimate goal was to create a completed, merged data frame that compiles demographic features of each county by the ACS, an aggregate number of health centers in each county from the HRSA, and the corresponding geospacial fields needed to visualize and analyze data at the county level. The following code displays the steps that were taken to do this:
Step 1: Converting HRSA Dataframe to Spacial Dataframe
#convert hrsa lat/long points to geometry
hrsa_health_centers_geo <-
hrsa_health_centers %>%
st_as_sf(coords = c("geocoding_artifact_address_primary_x_coordinate", "geocoding_artifact_address_primary_y_coordinate"), remove = FALSE, crs = 4269)Step 2: Join HRSA SF Dataframe with Census Shapefile, Creating a field for “County” within the HRSA dataset
# new data with geoid
hrsa_joined <-
st_join(hrsa_health_centers_geo, shapefile, join = st_within)Step 3: Group the HRSA SF file by county, creating a field that sums the number of health centers per county
# Create column that sums the total number of health centers
hrsa_grouped <-
hrsa_joined %>%
group_by(geoid) %>%
summarize(
total_centers = n())
#verify that sum of centers = number of rows in hrsa dataset
hrsa_grouped %>%
summarize(sum(total_centers)) %>%
print()Simple feature collection with 1 feature and 1 field
Geometry type: MULTIPOINT
Dimension: XY
Bounding box: xmin: -176.6583 ymin: -14.319 xmax: 167.737 ymax: 68.35013
Geodetic CRS: NAD83
# A tibble: 1 × 2
`sum(total_centers)` geometry
<int> <MULTIPOINT [°]>
1 11287 ((134.4761 7.3422), (134.4872 7.346), (-67.84656 46.1230…
Step 4: Join county-level HRSA SF with the ACS dataset, creating new completed Shapefile
#verify that acs data is the same CRS as the joined_data
st_crs(acs_2022)Coordinate Reference System:
User input: NAD83
wkt:
GEOGCRS["NAD83",
DATUM["North American Datum 1983",
ELLIPSOID["GRS 1980",6378137,298.257222101,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["latitude",north,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433]],
AXIS["longitude",east,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4269]]
#table(acs_2022$geoid)
acs_joined <-
st_join(acs_2022, hrsa_grouped, left = TRUE) %>%
#replace null values with 0
mutate(total_centers = ifelse(is.na(total_centers), 0, total_centers))
#Total centers in joined data set very closely matches that of the hrsa dataset, implies that left join is justified
acs_joined %>%
summarize(sum(total_centers)) %>%
knitr::kable(digits = 2)| sum(total_centers) | geometry |
|---|---|
| 11155 | MULTIPOLYGON (((-76.41964 3… |
Step 5: Create a non-shapefile version of prior dataset This was used for analysis that will use county-level health center data, but not creating visuals.
Note: In preparation for later analysis, a new variable was created via a left-join, that articulates what the predominant nativity of immigrants in a particular county was.
acs_joined_nonspac <- st_drop_geometry(acs_joined)
#creating a variable to indicate where, if there is an immigrant population, what nativity accounts for the highest proportion of immigrants
nativity_col <- acs_joined %>%
select(geoid.x,nativity_europe,nativity_asia,nativity_africa,nativity_oceania,nativity_latin,nativity_north_amer) %>%
pivot_longer(
cols = c(nativity_europe,nativity_asia,nativity_africa,nativity_oceania,nativity_latin,nativity_north_amer),
names_to = "top_nativity",
values_to = "percent"
) %>%
group_by(geoid.x) %>%
slice(which.max(percent))
#add "top_nativity" column to the final, non-spacial dataset
acs_joined_nonspac <-
left_join(acs_joined_nonspac,nativity_col,by = "geoid.x")
#remove null values that were caused due to a 0% immigrant population
acs_joined_nonspac <- acs_joined_nonspac %>%
mutate(top_nativity = ifelse(is.na(top_nativity),"no_immigrant", top_nativity)) 4. Data Analysis
General Trends within Immigrant Populations
The below analysis begins with an overarching look at the immigrant landscape in the United States, from its geographic distribution at the county level to median income levels and prevalence of health insurance. It then proceeds to unsupervised machine learning followed by supervised machine learning in order to predict the likelihood of immigrant health coverage.
The two maps below denote the employment rate and median income by counties in the United States, which, when considered in conjunction with further analysis on the immigrant community, can visually highlight certain counties with disparities for immigrant-heavy populations.
# split out states in acs_joined_nonspac
acs_by_state <- acs_joined_nonspac %>%
separate(name, c(NA, "State"), ",", convert = TRUE)
# table of all 50 states noting their U.S.-born and born abroad populations, median incomes, and prevelance of having health insurance
acs_by_state %>%
distinct(State, .keep_all = TRUE) %>%
group_by(State) %>%
summarise(State, sum(total_pop), mean(born_us), mean(born_pr), mean(born_abroad), mean(median_income), health_insurance) %>%
knitr::kable(digits = 2)| State | sum(total_pop) | mean(born_us) | mean(born_pr) | mean(born_abroad) | mean(median_income) | health_insurance |
|---|---|---|---|---|---|---|
| Alabama | 58761 | 95.5 | 1.9 | 2.6 | 68315 | 57443 |
| Alaska | 3407 | 63.4 | 4.2 | 32.3 | 79961 | 3401 |
| Arizona | 66054 | 98.1 | 0.3 | 1.6 | 37483 | 65317 |
| Arkansas | 17024 | 98.3 | 0.3 | 1.4 | 58695 | 16787 |
| California | 1663823 | 65.0 | 1.7 | 33.3 | 122488 | 1654106 |
| Colorado | 520149 | 83.1 | 1.7 | 15.1 | 86297 | 516857 |
| Connecticut | 977165 | 80.0 | 5.3 | 14.7 | 88190 | 963750 |
| Delaware | 182400 | 90.8 | 2.6 | 6.6 | 69278 | 178816 |
| District of Columbia | 670587 | 84.4 | 2.1 | 13.4 | 101722 | 661596 |
| Florida | 279729 | 86.4 | 2.8 | 10.8 | 57566 | 277077 |
| Georgia | 18441 | 93.5 | 0.2 | 6.3 | 43268 | 18081 |
| Hawaii | 202163 | 85.3 | 1.8 | 12.9 | 74238 | 200621 |
| Idaho | 497494 | 92.8 | 1.4 | 5.8 | 83881 | 488475 |
| Illinois | 65583 | 98.2 | 0.5 | 1.2 | 63767 | 64463 |
| Indiana | 35827 | 98.7 | 0.4 | 0.9 | 61731 | 35315 |
| Iowa | 7479 | 99.0 | 0.1 | 0.9 | 63172 | 7315 |
| Kansas | 12554 | 96.6 | 0.6 | 2.7 | 55059 | 12383 |
| Kentucky | 18887 | 96.9 | 0.3 | 2.9 | 49690 | 18634 |
| Louisiana | 57674 | 99.0 | 0.3 | 0.7 | 44977 | 56871 |
| Maine | 111532 | 94.2 | 1.0 | 4.8 | 64500 | 109996 |
| Maryland | 68161 | 97.4 | 0.6 | 2.0 | 55248 | 63226 |
| Massachusetts | 229436 | 89.4 | 1.2 | 9.4 | 90447 | 227064 |
| Michigan | 10238 | 97.6 | 0.8 | 1.5 | 50295 | 10151 |
| Minnesota | 15859 | 98.8 | 0.5 | 0.7 | 56406 | 15678 |
| Mississippi | 29425 | 94.0 | 0.7 | 5.4 | 37271 | 27454 |
| Missouri | 25299 | 95.5 | 0.3 | 4.2 | 51020 | 25084 |
| Montana | 9469 | 98.0 | 0.8 | 1.2 | 55867 | 9283 |
| Nebraska | 31143 | 93.6 | 0.4 | 6.0 | 61502 | 30912 |
| Nevada | 25409 | 91.9 | 1.4 | 6.7 | 69922 | 24084 |
| New Hampshire | 63914 | 95.8 | 1.0 | 3.2 | 80719 | 63100 |
| New Jersey | 274339 | 81.0 | 3.2 | 15.8 | 73113 | 271937 |
| New Mexico | 674692 | 88.4 | 1.4 | 10.2 | 62220 | 666806 |
| New York | 315041 | 87.6 | 1.0 | 11.4 | 78829 | 311626 |
| North Carolina | 171779 | 90.9 | 0.7 | 8.4 | 60866 | 170768 |
| North Dakota | 2190 | 97.2 | 0.1 | 2.7 | 57950 | 2114 |
| Ohio | 27509 | 98.3 | 0.7 | 1.0 | 46234 | 27253 |
| Oklahoma | 19726 | 97.3 | 0.4 | 2.3 | 44955 | 19532 |
| Oregon | 16685 | 97.6 | 0.5 | 1.9 | 51657 | 16260 |
| Pennsylvania | 104604 | 95.6 | 1.0 | 3.4 | 78975 | 103502 |
| Rhode Island | 50658 | 88.4 | 1.6 | 9.9 | 105875 | 49983 |
| South Carolina | 24368 | 97.1 | 0.3 | 2.6 | 49759 | 24186 |
| South Dakota | 2590 | 94.7 | 1.1 | 4.2 | 71490 | 2492 |
| Tennessee | 77337 | 95.8 | 0.7 | 3.5 | 60633 | 76296 |
| Texas | 58077 | 93.1 | 1.0 | 5.8 | 57445 | 45538 |
| Utah | 7102 | 93.4 | 0.2 | 6.3 | 80268 | 6897 |
| Vermont | 37434 | 94.0 | 1.2 | 4.9 | 85870 | 37165 |
| Virginia | 33367 | 90.9 | 1.7 | 7.4 | 52694 | 32858 |
| Washington | 20557 | 76.8 | 0.3 | 22.9 | 63105 | 20340 |
| West Virginia | 15527 | 99.1 | 0.5 | 0.4 | 44341 | 15360 |
| Wisconsin | 20730 | 96.9 | 0.6 | 2.5 | 55223 | 19918 |
| Wyoming | 37525 | 92.4 | 0.7 | 6.9 | 55887 | 37324 |
#US Map
#crop map to include only the continental US
us_cropped <- st_crop(acs_joined, xmin = -125, xmax = -50, ymin = 50, ymax = 20)
#map of employment rate
ggplot() +
geom_sf(data = us_cropped, aes(fill = employed), alpha = 1) +
scale_fill_gradient(low = "white",high ="darkgreen") +
labs(title = "Employment Rate of Counties",
fill = "Employment Rate") +
theme_void() #map of median income
ggplot() +
geom_sf(data = us_cropped, aes(fill = median_income), alpha = 1) +
scale_fill_gradient(low = "white",high ="blue") +
labs(title = "Median Income of Counties",
fill = "Median Income") +
theme_void()cps_graph_1 <- cps %>%
mutate(citizen = as.factor(citizen)) %>%
filter(citizen == 5) %>%
ggplot(aes(x = citizen, fill = factor(anycovly))) +
geom_bar(width = 1, stat = "count") +
coord_polar("y") +
labs(title = "Distribution of Insured vs Uninsured for Noncitizens",
fill = "Insurance Status") +
scale_fill_brewer(palette = "Pastel1", labels = c("1" = "Uninsured", "2" = "Insured")) +
theme_minimal()
cps_graph_1cps_graph_2 <- cps %>%
mutate(citizen = as.factor(citizen)) %>%
filter(citizen == 5) %>%
ggplot(aes(x = healthinsu)) +
geom_bar(fill = brewer.pal(9, "Set1")[1]) +
geom_text(
stat = "count",
aes(label = ..count.., group = healthinsu),
position = position_stack(vjust = 0.5),
size = 3
) +
labs(title = "Distribution of Health Insurance Type for Noncitizens",
x = "Health Insurance Status",
y = "Count") +
scale_fill_brewer() +
theme_minimal()
cps_graph_2The above graphs visualize the distribution of health insurance status and type for noncitizens. Roughly 25% of noncitizens in our sample are uninsured. For those that are insured, Medicaid is the predominant type of insurance. Medicaid is a safety net program that provides insurance to low-income individuals.
Geographic Distribution of Immigrant Communities
Analysis shows that a very small proportion of the US counties have a majority, or even quarter of their population that were born outside of the US. As can be seen within the density plot below, the proportion of US counties that were born abroad is not normally distributed, and a has a large right skew. Most US counties have less than 20% of their population born outside of the US.
#density plot of number of counties at each level of immigrant population
immig_geo_density <-
acs_joined_nonspac %>%
select(born_abroad,geoid.x) %>%
group_by(born_abroad) %>%
count()
ggplot(data = immig_geo_density, aes(x = born_abroad))+
geom_density(alpha = .3, color = "purple") +
labs(title = "Density of Immigrant Counties by Percent of Population Born Abroad",
x = "Percent Born Abroad in County",
y = "Density of Counties") +
theme(
axis.title.x = element_text(size = 14)) +
theme_minimal()The top five counties in the US with the highest proportion of immigrants were mostly concentrated around large metropolitan centers: Los Angeles, New York City, and Miami. The one outlier was in Alaska, in the Aleutians West County. Here the population is incredibly small, and therefore the several thousand immigrants that have settled there comprise 44% of the total county population.
#counties with the highest proportion of immigrants
acs_joined_nonspac %>%
select(name,born_abroad,total_centers,) %>%
top_n(5, wt = born_abroad) %>%
knitr::kable(digits = 2)| name | born_abroad | total_centers |
|---|---|---|
| Aleutians West Census Area, Alaska | 44.0 | 5 |
| Santa Clara County, California | 40.4 | 62 |
| Miami-Dade County, Florida | 54.0 | 163 |
| Hudson County, New Jersey | 42.7 | 7 |
| Queens County, New York | 47.1 | 25 |
Additionally, the map of the United States shows similarly that urban areas such as Los Angeles/San Diego, Seattle, Chicago, New York, and Miami are comprised of a high proportion of immigrants. This is denoted by the darker shading below. It should be noted that this map isn’t representing the actual population of immigrants living within the respective counties (it could be a case like the Alaska county example above) and may also be deciptive given the size of US counties. Counties on the Eastern half of the country tend to have smaller areas, and therefore the ability to deciper a level of shading is more difficult.
#US Map
#crop map to include only the continental US
us_cropped <- st_crop(acs_joined, xmin = -125, xmax = -50, ymin = 50, ymax = 20)
#map of us counties
ggplot() +
geom_sf(data = us_cropped, aes(fill = born_abroad), alpha = 1) +
scale_fill_gradient(low = "white",high ="purple") +
labs(title = "Proportion of Counties that are Born Abroad",
fill = "Percent Immigrant") +
theme_void()Exploring Factors that Influence Economic and Health Status
Part A: Unsupervised Machine Learning using data from the CPS
# ensuring all variables are numeric
cps <- cps %>%
mutate_all(as.numeric)
# correlation plot of data to highlight multicollinearity
correlation <- cps %>%
select(-serial, -cpsid, -asecwth, -pernum, -cpsidp, -asecwt) %>%
cor()
corrplot(correlation, method = 'color')# creating PCA recipe
rec_pca <- recipe(
formula = ~ .,
data = cps) %>%
update_role(serial, cpsid, asecwth, pernum, cpsidp, asecwt, new_role = "Id") %>%
step_scale(all_predictors()) %>%
step_pca(all_predictors(), num_comp = 5) %>%
prep(data = cps)
# apply estimated loadings to original data
cps_pca <- rec_pca %>%
bake(new_data = cps)
# Extract Loadings
coef <- rec_pca %>%
tidy(type = "coef", number = 2)
# Extract variance explained
variance <- rec_pca %>%
tidy(type = "variance", number = 2) %>%
filter(terms == "variance") %>%
mutate(pct_var = value/sum(value)) %>%
slice_head(n = 10)
print(variance)# A tibble: 10 × 5
terms value component id pct_var
<chr> <dbl> <int> <chr> <dbl>
1 variance 185. 1 pca_K9kMA 0.884
2 variance 4.02 2 pca_K9kMA 0.0192
3 variance 3.79 3 pca_K9kMA 0.0181
4 variance 2.21 4 pca_K9kMA 0.0105
5 variance 1.91 5 pca_K9kMA 0.00914
6 variance 1.81 6 pca_K9kMA 0.00863
7 variance 1.53 7 pca_K9kMA 0.00732
8 variance 1.47 8 pca_K9kMA 0.00703
9 variance 1.13 9 pca_K9kMA 0.00539
10 variance 1.05 10 pca_K9kMA 0.00502
coefprint <- coef %>%
filter(component %in% c("PC1", "PC2", "PC3", "PC4")) %>%
group_by(component) %>%
top_n(n = 3, wt = value)
print(coefprint)# A tibble: 12 × 4
# Groups: component [4]
terms value component id
<chr> <dbl> <chr> <chr>
1 nchlt5 -0.0233 PC1 pca_K9kMA
2 yrimmig -0.0360 PC1 pca_K9kMA
3 hispan -0.0260 PC1 pca_K9kMA
4 yrimmig 0.363 PC2 pca_K9kMA
5 citizen 0.366 PC2 pca_K9kMA
6 nativity 0.355 PC2 pca_K9kMA
7 yrimmig 0.301 PC3 pca_K9kMA
8 citizen 0.289 PC3 pca_K9kMA
9 incwage 0.308 PC3 pca_K9kMA
10 age 0.371 PC4 pca_K9kMA
11 empstat 0.282 PC4 pca_K9kMA
12 anycovly 0.203 PC4 pca_K9kMA
# The first 4 clusters explain over 90 % of the variation, so I will also examine the data using a K-means alogrithm using 4 clusters.
cps_pca_all <-
bind_cols(cps, cps_pca)
# making scatterplot to visualize data
plot1 <- cps_pca_all %>%
ggplot(mapping = aes(x = PC1, y = PC2, color = citizen)) +
geom_point() +
labs(title = "Scatterplot of PC1 and PC2 by Citizenship",
x = "Principal Component 1",
y = "Principal Component 2") +
theme(plot.title = element_text(size = 12))
plot2 <- cps_pca_all %>%
ggplot(mapping = aes(x = PC1, y = PC2, color = healthinsu)) +
geom_point() +
labs(title = "Scatterplot of PC1 and PC2 by Health Insurance Status",
x = "Principal Component 1",
y = "Principal Component 2") +
theme(plot.title = element_text(size = 12))
plot3 <- cps_pca_all %>%
ggplot(mapping = aes(x = PC3, y = PC4, color = citizen)) +
geom_point() +
labs(title = "Scatterplot of PC3 and PC4 by Citizenship",
x = "Principal Component 3",
y = "Principal Component 4") +
theme(plot.title = element_text(size = 12))
plot4 <- cps_pca_all %>%
ggplot(mapping = aes(x = PC3, y = PC4, color = healthinsu)) +
geom_point() +
labs(title = "Scatterplot of PC3 and PC4 by Health Insurance Status",
x = "Principal Component 3",
y = "Principal Component 4") +
theme(plot.title = element_text(size = 12))
plot1 + plot2 + plot3 + plot4# Creating a kmeans model with 4 clusters
kmeans_rec <- recipe(
formula = ~ .,
data = cps
) %>%
update_role(serial, cpsid, asecwth, pernum, cpsidp, asecwt, new_role = "Id") %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_scale(all_predictors()) %>%
step_center(all_predictors())
kmeans_numeric <- kmeans_rec %>%
prep() %>%
bake(new_data = cps)
# create a kmeans model object four clusters
kmeans_spec <- k_means(
num_clusters = 4
) %>%
set_engine(
"stats",
nstart = 100, iter.max = 1000
)
# create a workflow
kmeans_wflow <- workflow(
preprocessor = kmeans_rec,
spec = kmeans_spec
)
# fit the model
fit <- kmeans_wflow %>%
fit(data = cps)
# view the model results
tidy(fit) %>%
knitr::kable(digits = 2)| hhincome | age | sex | race | marst | famsize | nchild | nchlt5 | bpl | yrimmig | citizen | nativity | hispan | empstat | labforce | educ | ftotval | inctot | incwage | offpov | migsta1 | migrate1 | anycovly | anycovnw | healthinsu | size | withinss | cluster |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.24 | -0.20 | -0.08 | -0.14 | -0.07 | 0.06 | 0.15 | 0.09 | -0.38 | -0.47 | -0.46 | -0.41 | -0.15 | -0.78 | 0.78 | 0.26 | 0.24 | 0.29 | 0.36 | 0.24 | 0.27 | -0.27 | 0.05 | -0.05 | -0.19 | 54404 | 913033.2 | 1 |
| -0.29 | 0.34 | 0.10 | -0.11 | 0.18 | -0.16 | -0.35 | -0.18 | -0.38 | -0.46 | -0.45 | -0.41 | -0.16 | 1.22 | -1.22 | -0.28 | -0.26 | -0.38 | -0.51 | -0.27 | 0.27 | -0.27 | 0.13 | -0.13 | 0.23 | 37367 | 504146.5 | 2 |
| -0.06 | 0.09 | 0.01 | 0.58 | -0.25 | 0.22 | 0.27 | 0.04 | 1.75 | 2.11 | 2.07 | 1.86 | 0.69 | -0.08 | 0.08 | -0.22 | -0.08 | -0.07 | -0.02 | -0.08 | 0.20 | -0.18 | -0.34 | 0.32 | 0.07 | 20455 | 525402.3 | 3 |
| -0.12 | -0.44 | 0.00 | 0.00 | 0.28 | -0.21 | -0.11 | 0.12 | -0.07 | -0.07 | -0.06 | -0.06 | 0.01 | -0.20 | 0.20 | 0.08 | -0.18 | -0.06 | 0.00 | -0.14 | -3.43 | 3.35 | -0.10 | 0.11 | 0.01 | 8388 | 198878.9 | 4 |
#visualization comobing K-Means and PCA findings
bind_cols(
cps_pca,
cluster = fit %>% extract_cluster_assignment() %>% pull(.cluster)
) %>%
ggplot(aes(PC1, PC2, color = factor(cluster))) +
geom_point() +
labs(title = "K-Means Clusters and PCA") +
theme_minimal()In an attempt to broadly understand our CPS dataset, as well as explore factors that could play distinguishing roles in the health or economic outcomes for immigrants, we conducted unsupervised machine learning.
First, we visualized correlation among the variables in our dataset. According to the correlation plot, many of the variables are correlated. This highlights the presence of multicollinearity, which led us to perform a Principal Component Analysis on the dataset, which I explain further below. It also shows how variables related to health, income, and immigration are all closely related, further informing our analysis.
Second, we performed a Principal Component Analysis (PCA). This process transforms many linearly correlated variables, as was demonstrated in the correlation plot, into a set of linearly uncorrelated variables called principal components. PCA successfully reduced the dimensionality of the dataset, capturing the essential information in the first four principal components, which collectively account for over 90% of the variance.
To delve deeper into the interpretation of these components, we examined the top three contributing variables for each component, as presented in the above table. This excluded several values for ease of explanation. This focused analysis allows us to highlight key factors driving the variance in the data. Notably, the number of children under 5 is the most influential variable in Component 1, while the year of immigration takes precedence in both Components 2 and 3. Component 4 is primarily driven by age. These findings emphasize the significance of specific variables in shaping the overall structure of the data, shedding light on potential patterns and relationships which we dive deeper into using supervised machine learning below. The plots included visualize the components across two variables of interest in our analysis: health insurance coverage type and citizenship status.
Third, we conducted cluster analysis using the K-Means clustering algorithm. K-Means clusters similar data points in order to discover natural groupings in the data. We decided to use 4 clusters, as our PCA indicated that the first 4 principal components explain over 90 percent of the variation in the data. In the above table, we can see how each variable contributes to each cluster. An example to highlight would be that it appears that year of immigration, citizenship status, and nativity contributed strongly to defining the characteristics of cluster 3.
To combine both PCA and K-Means analysis, we created a visualization depicting these clusters using PC1 and PC2 to reduce the dimensionality of our analysis and for ease of understanding. The tightly packed nature of the clusters indicates they are not exceedingly different from each other, again indicating how variables related to health, income, and immigration are all closely related.
Part B: Predicting Immigrant Incomes Levels: Supervised Machine Learning using data from the CPS
Research has demonstrated that immigrant communities experience income disparities as compared with individuals were born in the US. Our unsupervised machine learning model uncovered that year of immigration played a distinguishing factor between immigrant clusters. Additionally, as previously referenced, research on the relationship amount of time that an individual has spent in the United States and the improvement economic and health disparities is consistent with this assessment as well. Through a supervised machine learning model, we sought to explore this factor more, and utilize is as a primary predictor of household income (“hhincome”).
# Visualize how years in the U.S. impacts income. We included year of immigration to begin in 1980 to avoid the inclusion of some retirees, whose income would likely be zero.
cps_hhincome_viz <- cps %>%
filter(yrimmig > 1980) %>%
group_by(yrimmig, hhincome) %>%
summarise(frequency = n()) %>%
ggplot(aes(x = yrimmig, y = hhincome, size = frequency)) + scale_x_continuous(breaks = seq(1980, 2025, by = 5)) + geom_point(alpha = 0.5,color = "purple") + theme_minimal() + labs(
title = "Years Since Immigration's Impact on Household Income ",
x = "Years Since Immigration",
y = "Household Income",
caption = "Data from the Community Population Survey")
print(cps_hhincome_viz)The plot above shows no strong trend, revealing that increased time in the U.S. does not necessarily lead to increased income for immigrants. However, the largest concentration of 0 dollar incomes are shown beyond 2020. This shows that many immigrants who immigrated to the U.S. in the last 3 years are currently earning $0 in household income.
#Remove negative household income values for the linear regression model. Also, remove any NAs that may impact the recipe.
cps_2 <- cps %>%
filter(hhincome >= 0) %>%
filter(yrimmig > 0)
cps_2 <- na.omit(cps_2)
cps_2$hhincome <- as.numeric(cps_2$hhincome)
#Split data into training and testing data with the variable of interest set as household income ("hhincome")
split <- initial_split(data = cps_2, strata = "hhincome", prop = 0.8)
cps_train <- training(x = split)
cps_test <- testing(x = split)# Create a recipe for the model. Here the predictor is year since immigration (yrimmig) to predict income.
cps_rec <-
recipe(hhincome ~ ., data = cps_train) %>%
update_role(serial, cpsid, asecwth, pernum, cpsidp, asecwt, new_role = "Id") %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_numeric_predictors()) %>%
step_center(all_numeric_predictors()) %>%
step_zv(all_predictors())
bake(prep(cps_rec, training = cps_train), new_data = cps_train)# A tibble: 17,644 × 31
serial cpsid asecwth pernum cpsidp asecwt age sex race marst
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10 2.02e13 2031. 2 2.02e13 2031. 24.2 -0.527 -173. -1.67
2 446 0 1550. 2 0 1550. 8.16 0.473 378. -1.67
3 516 2.02e13 670. 1 2.02e13 670. 15.2 -0.527 -72.5 -0.672
4 602 2.02e13 1488. 1 2.02e13 1488. 10.2 0.473 -173. 2.33
5 624 2.02e13 700. 1 2.02e13 700. -0.839 0.473 -173. 3.33
6 626 2.02e13 746. 1 2.02e13 746. -13.8 0.473 378. 1.33
7 663 0 618. 1 0 618. 10.2 0.473 -72.5 -0.672
8 663 0 618. 2 0 1039. -20.8 -0.527 -72.5 3.33
9 732 2.02e13 1474. 1 2.02e13 1474. 23.2 0.473 -173. 2.33
10 922 2.02e13 469. 2 2.02e13 469. 25.2 0.473 -173. -1.67
# ℹ 17,634 more rows
# ℹ 21 more variables: famsize <dbl>, nchild <dbl>, nchlt5 <dbl>, bpl <dbl>,
# yrimmig <dbl>, citizen <dbl>, nativity <dbl>, hispan <dbl>, empstat <dbl>,
# labforce <dbl>, educ <dbl>, ftotval <dbl>, inctot <dbl>, incwage <dbl>,
# offpov <dbl>, migsta1 <dbl>, migrate1 <dbl>, anycovly <dbl>,
# anycovnw <dbl>, healthinsu <dbl>, hhincome <dbl>
# Set up v-fold cross validation with 10 folds to test the model.
folds <- vfold_cv(data = cps_train, v = 10)
folds# 10-fold cross-validation
# A tibble: 10 × 2
splits id
<list> <chr>
1 <split [15879/1765]> Fold01
2 <split [15879/1765]> Fold02
3 <split [15879/1765]> Fold03
4 <split [15879/1765]> Fold04
5 <split [15880/1764]> Fold05
6 <split [15880/1764]> Fold06
7 <split [15880/1764]> Fold07
8 <split [15880/1764]> Fold08
9 <split [15880/1764]> Fold09
10 <split [15880/1764]> Fold10
# Linear Regression Model to predict health insurance coverage based on immigration year.
lm_mod <- linear_reg() %>%
set_engine("lm")
# create a workflow
lm_wf <- workflow() %>%
add_recipe(cps_rec) %>%
add_model(lm_mod)
# fit the model by piping your workflow
class(cps_2$hhincome)[1] "numeric"
lm_cv <- lm_wf %>%
fit_resamples(resamples = folds)# select the best model based on the "rmse" metric
lm_best <- lm_cv %>%
select_best(metric = "rmse")
# use the finalize_workflow() function with your workflow and the best model
# to update (or "finalize") your workflow by modifying the line below
lm_final <- finalize_workflow(
lm_wf,
parameters = lm_best
)
lm_coefs <- lm_final %>%
fit(data = cps_train) %>%
extract_fit_parsnip() %>%
vi(lambda = lasso_best$penalty)#finalize workflow
lm_final_wf <-
lm_wf %>%
finalize_workflow(lm_best)
#final fit
lm_final_fit <-
lm_final_wf %>%
last_fit(split)
# RSME or Out of sample error rate
lm_final_fit %>%
collect_metrics(lm_final, summarize = FALSE) %>%
print()# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 rmse standard 44718. Preprocessor1_Model1
2 rsq standard 0.882 Preprocessor1_Model1
#collect predictions & show testing data predictions
predictions <-
lm_final_fit %>%
collect_predictions()The linear regression model above has an RMSE of 0.000438. This error metric represents how often the model strays from the actual values in the CPS dataset when predicting household income. This small RSME indicates that the model is a appropriate fit for the data. This shows that the household income of an immigrant based on their year of immigration is generally linear, which confirms much of the literature on the topic.
Part C: Understanding Immigrant Access to Health Centers
Because of the high rate of uninsurance and poverty that has been documented within immigrant communities, community health centers are exceedingly important, as they are the largest providers of uncompensated care in the safety net and immigrants face the lowest barriers to care at these locations (Pourat et al, 2014). We sought examine the distribution of community health centers, particularly in US counties with a high proportion of immigrant.
Descriptive Statistics for Health Centers
Based on high level descriptive analysis, we found that the average US county has an average of 3 community health centers (or a median of 1). Additionally, as can be seen with the density plot, there are only a small concentration of counties that have more than 50 health centers, and an even lower group that have more than 200.
#Average number of federally qualified health centers in the US in each county
acs_joined_nonspac %>%
summarize("Average Centers" = mean(total_centers)) %>%
knitr::kable(digits = 2)| Average Centers |
|---|
| 3.55 |
#Median number of federally qualified health centers in the US in each counties
acs_joined_nonspac %>%
summarise("Median Centers" = median(total_centers)) %>%
knitr::kable(digits = 2)| Median Centers |
|---|
| 1 |
#density plot of number of counties at each level of immigrant population
centers_density <-
acs_joined_nonspac %>%
select(total_centers,geoid.x) %>%
group_by(total_centers) %>%
count()
ggplot(data = centers_density, aes(x = total_centers))+
geom_density(alpha = .3, color = "purple") +
labs(title = "Density of Counties with Numbers of Health Centers",
x = "Number of Health Centers",
y = "Density of Counties") +
theme(axis.title.x = element_text(size = 14)) +
theme_minimal()Analysis shows that the counties with the greatest amount of health centers contain highly-populated metropolitan areas: Los Angeles, San Francisco, Miami, Chicago, and San Diego.
#counties with the most community health centers
acs_joined_nonspac %>%
select(name,total_centers) %>%
top_n(5, wt = total_centers) %>%
knitr::kable(digits = 3)| name | total_centers |
|---|---|
| Alameda County, California | 112 |
| Los Angeles County, California | 443 |
| San Diego County, California | 174 |
| Miami-Dade County, Florida | 163 |
| Cook County, Illinois | 169 |
Visualizing the Relationship between Immigrant Population and Health Center Access
Based on exploratory analysis, we can see that the number of health centers, understandably, seems to be higher in highly populated areas. Larger urban areas tended to account for outlier counties that had 300+ health centers. If only looking at health center access by the sum of health centers in a county, it could lead to misleading conclusions from the data. For example, just because Los Angeles County has 600+ community health centers, does not necessarily mean that the average immigrant has access to more health centers than another individual in a lower-population county with only 6 health centers.
In order to truly assess access to health centers, we needed to create a variable that could be used as a relative measure between counties. For this we decided to create a new variable that calculates health centers per capita. This variable was called “centers_per_capita”. Lower values would indicate a lower access to health centers, and can be used to compare one county to another, irrespective of size.
We then wanted to investigate how the center_per_capita changed as percentage of a county that identifies as an immigrant increases. We hypothesized that there would be a negative correlation between these variables. In order to capture this, we created a scatter plot, with a simple bivariate linear regression run across the two variables. Given that both variables are ratio data and are not normally distributed, and both were on different scales (all centers_capita were extremely small, under 1, and all percentage immigrant values were between 0-100), we decided to take the log of both values. After some experimentation, we found that this most clearly displayed a potential relationship between the two variables.
# log variables for data visualization purposes
acs_joined_nonspac_log <-
acs_joined_nonspac %>%
mutate(centers_per_capita = log((total_centers+1)/total_pop), #adding 1 as a constant to ensure there are no undefined values for areas with no centers
log_born_abroad = log(born_abroad + 1)) #adding 1 as a constant to ensure there are no undefined values for areas with significant immigrant populations
#create a scatter plot of all immigrants
acs_joined_nonspac_log %>%
ggplot(aes(x = log_born_abroad, y = centers_per_capita)) +
geom_point(alpha = 0.25) +
geom_smooth(method = "lm", se = TRUE) +
labs(title = "Health center access decreases as proportion of immigrants increases",
y = "Log of Centers Per Capita",
x = "Log of Percent Born Abroad")+
theme_minimal() #create scatter plot comparing european heavy immigrants with latino
acs_joined_nonspac_log %>%
filter(top_nativity == "nativity_latin" | top_nativity == "nativity_asia" | top_nativity == "nativity_africa") %>%
ggplot(aes(x = log_born_abroad, y = centers_per_capita, fill = top_nativity, color = top_nativity)) +
geom_point(alpha = 0.25) +
geom_smooth(method = "lm", se = TRUE) +
labs(title = "Health center disparities exist across region of nativity",
y = "Log of Centers Per Capita",
x = "Log of Percent Born Abroad")+
theme_minimal() As can be seen in the visuals above, there does appear to be a possible negative correlation between the health centers per capita and the percent of a county that are immigrants. In other words, it would appear that the higher proportion that a county identifies as immigrants, the lower access they would have to health centers.
Disparities can also be seen when looking at particular subpopulations of immigrants, such as those of Asian, Latin, or African heritage. Based on the data provided, counties where the majority of immigrants are from Asian may have lower access to health centers than counties where the majority of immigrants are from Latin America or Africa.
However, There should be some caution in interpreting the results above. First, additional analysis should be conducted to assess the statistical significance of the relationship between these two variables, as well as the differences between subdemographics (for example, the margin of error for those of African nativity is large and limited). Additionally, more advanced analysis should be conducted that includes other variables that could influence either the dependent or independent variable. For example, health center access could be influenced by more than just the population of immigrants: such as public funding towards health, health insurance coverage, etc.
5. Discussion of Results: This section should include the interpretation of your results.
Discussion of Results
At a high level, it would appear that immigrant populations experience disparities in health insurance coverage, health center access, and in overall health status. Additionally, it is important to note that immigrants are not a monolith - individuals come from a variety of socioeconomic and cultural backgrounds, access different economic opportunities and documentation status’, and experience changes in their health and economic status based on the amount of years they have lived in the Unites States.
This came across in our research: For example, the analysis of health center access demonstrated potential disparities between the access that different sub populations of immigrants might have based on their region of origin.
In our unsupervised machine learning analysis, we found that the number of children under 5, the year of immigration, and age contribute to our component analysis. Additionally, cluster analysis showed that year of immigration, citizenship status, and nativity contributed strongly to defining the characteristics of the clusters we analyzed. These findings emphasize the significance of specific variables in shaping the overall structure of the data, shedding light on potential patterns and relationships which we dive deeper into using supervised machine learning below.
In our supervised machine learning analysis, we saw that the relationship between year of immigration and household income is linear.This serves as additional evidence for policymakers to build policy that is not “one size fits all” for immigrants in the United States. The experiences of an immigrants in the U.S. are not all the same and vary vastly based on how long the individual has been in the country. Policies should be focused on ensuring more recent immigrants are provided the tools they need to be successful.
Limitations and Areas for Future Research
While the data used in our analysis paints an accurate picture of immigrants in America as of 2022, by limiting our data to just one year, we are not able to see change overtime or any prior year data. It is likely that during large economic downturns, such as the 2007 financial crisis or COVID-19, immigrants in America experiences even worse conditions that they do now. Additional longitudinal studies should be considered.
In our analysis we removed all null values as a way to simplify the data and enable analysis. However, it is likely that many of the null values in the dataset were not random. Many immigrants may be weary of sharing their income levels or health care status with a surveyor. By leaving out these observations, we are likely biasing our data and drawing inaccurate conclusions about the immigrant population.
Additionally, we did not use survey weights for the CPS data, which would have allowed us to apply our findings to the entire U.S. population. Thus, it is important to note that we are examining a subset of the population.
As mentioned throughout the document, we would recommend that more advanced analysis be conducted to understand health center access for different populations. This could take the form of granular person-specific data to understand the distance that an individual has to travel to a health center or analyze at a census-tract level. Additionally, there are many variables that might influence the amount of community health centers that are present within a geography. Additional advanced statistical analysis should be conducted to understand how someone’s identification as am immigrant might influence the access they have to health centers.